1 Introduction

The built environment embodies neighborhood change in many ways. Hidden in house sale transactions are considerations of personal values and spatial processes that affect demand for housing. A few such factors are the transportation cost to supply stores or work, the perceived safety of a neighborhood, the distance to schools, and the amount of natural land within an area. Local knowledge gives reference for these considerations when people are choosing where to move. In addition, the housing stock and the locations available on the market introduce more limitations that affect house prices and could price people out or divert capital to another part of town. Macro-level trends such as crime, culture, and development trends also influence spatial processes and homebuyers’ decisions. Together, these multiple scales of factors direct where people choose to live and what types of homes they live in. The sum total of individuals interacting with the housing market based on these considerations necessitates changes in sale price over time, as lower value homes are fixed and flipped, supply in neighborhoods declines, disinvestment takes place, and more. On the other hand, as new housing becomes available at different income levels, people move and open up their old homes for new residents. Thus, neighborhood change is always taking place, and the question that we ask in this project is: can we predict future home sale prices across Philadelphia based on recent data?

Zillow, a go-to source for prospective home-buyers interested in “Zestimates” of home value predictions does this with their own model, but it exhibits errors most likely due to a lack of local context accounted for in their model.

Our group set out to develop a model that predicts home sale prices in Philadelphia based on a variety of existing data. We used a machine learning framework that explored different characteristics related to home values to construct a model that is both accurate and that generalizes to different contexts. Our model relies heavily on the use of ordinary least squares (OLS) regressions, which gives us an equation for the “line of best fit” (LBF) that considers how different variables relate to each other and the observed sale prices in our data. We then use the equation for our LBF to predict home prices and interpret our measurements of error to improve our model. This report features visualizations and descriptions that highlight different aspects of our model-creation process, our analysis, a discussion of our results and a conclusion to our recommendations for Zillow.

1.1 Data Sources

To build our model, we were given a dataset of homes sold in Philadelphia in 2022-2023 as well as their internal characteristics, including number of bedrooms, total livable area, and interior condition.

We also wrangled open data from a variety of sources to build our predictor variables.

  1. American Community Survey 5-year census tract-level estimates, 2021 (via the tidycensus package in R)
  2. Tree point data (2015) from Philadelphia Parks & Recreation (via OpenDataPhilly)
  3. Philadelphia neighborhood boundaries from Azavea
  4. Commercial corridors from the City of Philadelphia (via OpenDataPhilly)
  5. Building and zoning permits (2018-present) from Philadelphia Department of Licenses and Inspections (via OpenDataPhilly)
  6. Shootings from Philadelphia Police Department (via Carto)

After joining these data to our dataset of homes, a number of homes had missing data values. In an effort to avoid exposing our model to serious risks of bias, we imputed mean values for missing observations in the variables used to construct our model.

1.2 Summary statistics

One of the first steps to creating our model is to investigate our predictor variables and their relationships to our dependent variable: house sale price. We also use these summary statistics to get to know our data and the housing stock of Philadelphia on a preliminary basis.

## 
## Table 1: Descriptive Statistics for Homes Sold in Philadelphia
## ==========================================================
## Statistic            N     Mean   St. Dev.  Min     Max   
## ----------------------------------------------------------
## interior_condition 23,881   3.6     0.9     0.0     7.0   
## number_of_bedrooms 23,881   2.6     1.3     0.0    31.0   
## total_area         23,881 1,828.9 3,813.0   0.0  226,512.0
## total_livable_area 23,881 1,333.6  548.8     0    15,246  
## Age                23,881  86.3     51.1    -1     2,023  
## MedRent            23,881  990.1   292.4   249.0  2,339.0 
## ICE_tract          23,881  0.03     0.3    -0.6     0.7   
## trees_per_sqmi     23,881 1,854.3 1,694.1  55.5   8,669.5 
## permit_nn          23,881  430.7   273.8    0.0   3,429.5 
## gsw_qtrmile        23,881  41.8     56.7     0      488   
## ----------------------------------------------------------

1.3 Data Visualization

It is best to start with a visualization of home sale prices across Philadelphia, which is what our model is aiming to predict. Below is a map that shows the distribution of home sale prices across Philadelphia since May 24th, 2022.

Because there are visible clusters of prices in different parts of the city, there are likely spatial processes influencing home sale price outcomes. Therefore, in our exploratory process, we wrangled variables with spatial components. A key goal of any prediction model is to be accurate, meaning that when we plot the difference between what is observed and what our model predicts, we should see low error values. Furthermore, our model must be generalizable across space, and errors should be evenly distributed when we map them.

1.3.1 Correlation Matrix

The following correlation matrix is created using the variables in our dataset. It is an important part of the exploratory data analysis process which visualizes the variable’s relationships and facilitates the process of selecting what to include in our model.

The relationship between variables can be positive or negative and the strength of the relationship is noted by the saturation of the colors. When a variable is related to itself, the strength of the relationship will be strong and visible in the correlation matrix. A strong relationship can indicate “collinearity”, which introduces error into our model, when two predictor variables are very similar to each other. In these cases, we carefully choose only one of them.

1.3.2 Home Price Scatterplots

After assessing the matrix above, we created the following scatterplots to further explore the relationships of potential predictor variables to sale price values. They are: the racial and economic index of concentration at the extremes (ICE) at the tract level, trees per square mile, the average distance to the nearest 5 new construction permits, and the total living area of each home. Each scatter plot shows the “line of best fit” for reference, with the slope of the line indicating the magnitude of their correlation.

The total living area of a home stands out as having a positive correlation to sale price, which makes sense given that space is a large part of the utility value considered when purchasing a home. The average distance to the 5 nearest new construction permits also shows an interesting relationship: as distance increases, sale price drops. We believe this is a function of value-creation that takes place when permits are approved and landowners are able to inject capital into their property which is then recovered at the time of sale.

The other two variables have weaker positive correlations that are worth mentioning. It seems that people are willing to pay more to be in areas that have trees, or most likely areas near parks and nature. There is also a socioeconomic basis for sale price outcomes.

The ICE metric was developed by Massey (2001) and elaborated on by Krieger and colleagues (2016) to capture “extremes of privilege and deprivation” as well as measure both economic and racial segregation. It is calculated using the following formula: \[ICE=\frac{(A-P)}{T}\] Where \(A\) represents the number of households belonging to the privileged extreme, while \(P\) is the number of households who belong to the deprived extreme in a given area. \(T\) is the total number of households in that area.

ICE can take any value between -1 and 1; a value of 1 indicates that all residents are in the “privileged” group, and a value of −1 indicates that all residents are in the most “deprived” group. For this model, Krieger and colleagues’ definitions will be used: the privileged group is defined as White households with incomes over 100,000 USD, and the deprived group is defined as Black households with incomes below 25,000 USD.

As ICE score increases, so does home sale price. Furthermore, the vast majority of homes with sale prices over $1 million are in tracts that have more privileged households than deprived ones. This pattern of disinvestment is a key determinant of racial and economic segregation.

1.4 Maps of Selected Variables

The following maps visualize a selection of predictor variables that we believe affect home sale prices. Given that we are trying to identify spatial processes involved in the outcomes of our observed data, these variables should have clusters when mapped. The data are split into 5 quartiles and range from 1-5 respectively with 1 being low and 5 being the higher (5th quartile) values.

1.4.1 Mean Price Lag by Census Tract

We first present the mean “price lag” of all sales in each census tract. Our price lag calculation is a function of locating, for each sale in our dataset, the nearest five sales that took place, and then averaging their values. The result is a metric that partially accounts for the way that homes are evaluated and influence each other’s sale prices at the time of appraisal. The map shows clear clusters of high values in Northwest Philadelphia, as well as in & around Center City. Similarly, there are clusters of low values concentrated in North, West, and Southwest Philadelphia. Although this map provides insight into the spatial distribution of prices, it does not explain what influences them. It could be a matter of the characteristics of existing housing stock, or the amenities available in the area. Our next maps shed some light on this question.

1.4.2 Mean Distance to Corridors by Census Tract

This map shows the mean distance from each sale to a commercial corridor in Philadelphia. It nearly has an inverse relationship with the map above, suggesting that a part of the reason that higher sale prices cluster is because they are closer to commercial areas. Commercial areas are usually accessible via public transit, and some commercial areas double as significant transportation nodes. We know from previous work in this course that people are willing to pay more to be near public transportation; thus, the distance to commercial corridors as a factor of sale price in our dataset incorporates this demand into our model.

Lastly, it should be noted that there are areas with both higher sale prices and greater distances to commercial corridors. These areas tend to be on the outskirts of the city, suggesting that there might be lifestyle and/or neighborhood factors affecting the consumption of housing in these areas. It is possible that in these parts more people rely on cars to get to commercial corridors and being within walking distance to them holds less value. Conversely, there are areas near commercial corridors that show low-end sale prices, which implies that there are neighborhood and housing stock factors influencing the consumption of housing in these areas. It also implies that distance to commercial corridors is not by itself a significant predictor of sale price.

1.4.3 Average Gun Shot Wounds per Quarter Mile of Each Sale per Census Tract

The average count of shooting victims per sold home is represented in the map below. There are clusters of high counts of reports in the North, West, and Southwest Philadelphia, as well as in Grays Ferry and Point Breeze in South Philadelphia. In these areas, it appears that there is an inverse relationship with the values of the sale price lag. It suggests that there are lower prices in areas where there are higher reports of gun shot victims. For newcomers and long-time residents alike, a city’s image is greatly influenced by the perception of safety in its neighborhoods. That perception is often tied to the frequency of violent events.

We would also like to note that a local contextual factor worth accounting for is territories of change. While some parts of Philadelphia show inverse relationships, South Philadelphia has a mixture of gunshot wound victims values despite having higher average prices. There could be changes happening at a block-to-block and parcel-to-parcel level as the area undergoes a certain stage of a gentrification process. This might make it harder to predict prices in areas that are defined by increasingly widening rent gaps.

1.5 Average Distance to Five New Construction Permits of Each Sale per Census Tract

This visualization was created using the cleaned permit data that we brought in. We posited that the distance to the nearest 5 permits could account for processes of change that affect sale price. Theoretically, if a home is near new construction permits, then its price will be affected for a number of reasons. One is that new construction has a cost that must be recovered at the time of sale for the developer/seller, raising the sale price for a lot that could have been bought for half the sale price. Another other reason is that, by accounting for permits, perhaps we can better predict sale prices in areas that have rent gaps. A rent gap is a spread in going sale prices that is created when prospective home buyers or investors inject capital into a project at the parcel level in historically disinvested areas of the city. Sale prices may also be higher in areas with high concentration of permits because they are a reflection of the city’s interest in raising property values through investment so that they can in turn increase their revenue from property taxes. With this in mind, it is unsurprising to see a concentration of permits the high-priced Center City neighborhood, north of Center City in neighborhoods with lower mean price lags and higher counts of gun shot wounds, and in the highly gentrified Fishtown area.

2 Methods

Upon choosing our variables for regression, we randomly partitioned our dataset of homes into “training” and “test” by neighborhood to obtain an equal distribution of neighborhoods in our partitions. We then used the training set to create our line of best fit using the linear model function that creates an object that has a summary of how each of the variables contributes to sale price outcomes when we control for the other variables. Each of our variables is returned with a “coefficient” which says that as that variable’s unit of measurement increases by 1, the sale price increases or decreases by that value.

We also interpret the results of our summary statistics to see how well our model performs; that is, how much of the variance in sale price it can predict. We use the value of \(R^2\), which measures the proportion of variation in the dependent variable (sale price) that is explained by the linear combination of our selected features. We interpret \(R^2\) on a percentage basis. In addition, for each of the variables in our regression, the summary calculates its p-value, which is a measurement of that variables reliability and statistical significance in our model. When p-values are lower than .05, for example, it means that we can be at least 95% confident that our coefficient estimation is significantly reliable and a useful predictor of sale price.

Predicting the test dataset using the line of best fit equation from our OLS regression on the training data is a good way to gauge the accuracy and generalizability of our model. We analyze the quantities of our error to understand the extent of the deficiencies of our model, as well as where on the map they appear. It could also verify the validity of our model. From the results of our predicted values, we calculate the mean absolute error (MAE) and the mean absolute percent error (MAPE) of our errors. MAE is the average of the absolute values that we get from subtracting our observed value from the predicted value across all observations. MAPE is calculated by taking the absolute value of the difference between our predicted value and observed sale price, divided by the predicted value. Together, these values give us insight into how well our model works on new data and for what range of sale prices it is failing to predict.

We want our model to be robust, so we use a process called “k-fold” cross-validation to partition our data into \(k\) pieces, then use each piece as a sample test set while the rest are used to train a model to predict the test set values. Each of these test-training sets represents a fold. The cross validation process then summarizes the mean value of error across all of the folds. We also analyze the distribution of across-fold MAE results to understand if our model is powerful enough to accurately predict house prices for new data with little error.

If our model returns significant errors, it will most likely be due to spatial factors. We look for clusters of our errors in our data both in a scatterplot and on a map to diagnose where we are having problems. Clustering is also known as spatial autocorrelation, which is the idea that things that are near each other are more related than farther things. It is also known as the first law of geography. To measure if we are indeed seeing clusters of errors in our data, we calculate the Moran’s I and map them.

Our analysis is rounded out with a test of generalizability. We split our data by tracts with percentages of White people greater and less than the city-wide average, respectively. We calculated the MAE and MAPE of their respective models, then mapped them. Based on these results, we determined if our model had enough predictive power to generalize to new contexts. The results are discussed below.

3 Results

3.1 Training Model Regression Results

The following are a summary of results of our OLS regression on sale price using the listed variables and our training dataset.

## 
## Table #: LM of Training Data
## =================================================
##                          Dependent variable:     
##                     -----------------------------
##                              sale_price          
## -------------------------------------------------
## total_livable_area           159.980***          
##                                (1.979)           
##                                                  
## total_area                    1.529***           
##                                (0.273)           
##                                                  
## interior_condition         -36,719.690***        
##                              (1,067.908)         
##                                                  
## number_of_bedrooms          3,826.189***         
##                               (733.596)          
##                                                  
## dist_to_commerce              -4.052***          
##                                (1.535)           
##                                                  
## price_lag                     0.554***           
##                                (0.007)           
##                                                  
## ICE_tract                   56,260.270***        
##                              (5,047.979)         
##                                                  
## permit_nn                     23.410***          
##                                (4.249)           
##                                                  
## trees_per_sqmi                17.530***          
##                                (0.793)           
##                                                  
## MedRent                       25.919***          
##                                (4.947)           
##                                                  
## gsw_qtrmile                  -82.179***          
##                               (19.323)           
##                                                  
## Constant                   -31,237.330***        
##                              (7,665.532)         
##                                                  
## -------------------------------------------------
## Observations                   21,471            
## R2                              0.721            
## Adjusted R2                     0.721            
## Residual Std. Error   128,307.300 (df = 21459)   
## F Statistic         5,036.001*** (df = 11; 21459)
## =================================================
## Note:                 *p<0.1; **p<0.05; ***p<0.01

The constant in our table is the sale price when all other variables are zero. The way that we interpret the rest of the table is quite simple for the purposes of analyzing our model. All of our independent variables are reliable predictors of sale price. The p-values marked with 3 asterisks confirm that it is highly unlikely that there is no relationship between our dependent variable and the predictors. The value adjacent to the asterisk are the coefficients of the independent variable. They are interpreted as the quantity (in dollars) of change observed in sale price per unit of change in the dependent variable. Lastly, the \(R^2\) statistic measures the proportion of variation in sale price that is explained by the combination of our features.

3.2 Cross-Validation

We performed a 100-fold cross-validation (CV) test with our model. Here we present a histogram and summary statistics of our results. The average MAE tells us, across all folds, what our error was on average per fold. The standard deviation (SD) of the MAE across all folds tells us how well our model generalizes to new data. If the SD has a high value, then there is too much variation across folds and our model is not generalizing well to new data, producing a wide range of errors. When it is low, as in our case, it means that the distribution of errors is clustered together and our model passes the test of generalizability. In addition, the clustering of errors is reflected in a histogram. Although the distribution of errors looks wide at first, the range of values is narrow, ranging from about $55,000 to $80,000. Both of these observations tell us that our model is generalizable to new data, and the MAE across all folds indicates that our model is quite powerful.

Table #: Cross Validation Results
Variable Value
Mean of MAE 66111.096
SD of MAE 5776.314

3.3 Single Test Statistics and Scatterplot

The following is a description of a chosen fold in our cross-validation test. Fold 8 is comparable to the average of our cross-validation results. It is a glimpse of the many folds that make up our final histogram of cross-validation results.
Table #: Regression Results of One Test Set
MAE MAPE
69521.021 0.388

3.4 Predicted prices as a function of observed prices

We created a scatterplot with our predicted values on the y-axis and the sale price values on the x-axis for each observation in our test data. The orange line represents the line of best fit for a perfect prediction. The green line of best fit is the result of our model. The scatterplot hints that our model struggles to accurately predict homes that sell for at least $750K.

3.5 Residuals Map, Moran’s I and Spatial Lag of Errors

The first map is of our residuals, the errors of our model in the test set. Here we are looking for pockets of concentrated values to determine if there is spatial autocorrelation. We observe in this map that there is a general mix of errors across space, suggesting that factors that are affecting our model’s accuracy may not be due to spatial processes. However, this alone is not enough to determine whether or not there is a spatial component influencing our errors. A more accurate test is to calculate Moran’s I, which we perform next.

3.6 Global Moran’s I and Spatial Lag of Price Errors

Our calculation of Moran’s I measures the significance of the distribution across space of our errors in the test set. It creates 999 random permutations and measures Moran’s I for each of them and then takes their mean. It then calculates a p-value for our observation based on the null hypothesis that the errors are randomly distributed, implying whether or not what we observe is due to spatial autocorrelation.

In the case of our data, our observed Moran’s I is greater than the vast majority of the permutations. This confirms that there is some spatial autocorrelation in our errors, leading us to go back to the drawing board and consider what features we could engineer with pre-existing data to account for a particular spatial process that we have not accounted for yet.

The spatial lag is a measurement of how observations influence each other in space. In our model, we use it to note how the price of one sale may affect nearby sales. This is done through the calculation of the average sale price value for a determined amount of “neighbors”, or observations near each home sale. Furthermore, we can do the same thing with our errors. We present a plot of sale prices as a function of the spatial lag in our errors. There is a marginal correlation, the spatial lag of errors increases as sale prices errors increase. Like the Moran’s I test, this plot implies that there are spatial processes that have gone unaccounted for in our model.

3.7 Mapping Predicted Values for All Homes

Despite the distribution of our errors containing a spatial lag, when we compare the map of actual home sales with our map of predicted prices, it is difficult to identify clear differences in space. This comparison highlights the importance of performing statistical analyses on our errors such as Moran’s I and our MAE.

3.8 MAPE by Neighborhood

Since it is hard to distinguish where we errors are arising in the previous maps, we visualized the MAPE by neighborhood. An effective way to test and think through why errors cluster in space is by mapping them. In identifying where the errors are occurring in the city, we can develop ways to investigate what dynamics are unique to those areas and consider how we could best account for them in our model.

Although our model used ICE to account for segregation, perhaps it is unable to properly account for the manifold consequences of segregation that lead to disparities in home sale prices. As seen in the previous scatterplots, higher home sale prices rarely occurred where there were more poor underprivileged households than wealthier ones. Perhaps with data that describes the distribution of vacant lots or abandoned properties, or even of the ratio of public to private school students within an area, our model would become more accurate and predict home prices in neighborhoods such as North Philadelphia.

Beyond spatially analyzing MAPE values, we can also look at their distribution in a scatterplot. In the map we see where in space we are getting our errors, but in a scatterplot of the MAPE as a function of mean price per census tract, we can see for what range of values we observe the highest percent errors. Given where in the map we are getting our highest errors, it is not surprising that our largest MAPE values are in the lower end of sale prices. This part of the testing process informs what the next geospatial feature that we engineer should be: namely, something that reduces errors at this end of prices. One such factor could take into account the relationship between the quality of nearby schools their distance to each sale price. Accounting for such a factor could make our model more accurate.

3.9 Testing for Generalizability

In order to test our model for generalizability, we needed to split our study area into two groups and assess the model’s accuracy in each group. We decided to split by the white percentage of census tract populations. Whiteness defines a widely generalizable gradient of sociopolitical power across the US and particularly in Philly, where there are also huge economic disparities between white and non-white populations. Because white people comprise about one-third of Philadelphia’s population, we split our census tracts at 33%. That is, one partition (“MoreWhite”) consists of census tracts with white percentages over 33%, and the other (“LessWhite”) contains all other census tracts.

Our regression results for each partition are shown below. The MAE and MAPE for the MoreWhite partition are less than that of the LessWhite partition. Based on these results, we suggest that our model may be underperforming in areas with lower proportions of white people.

Regression Results for MoreWhite
MAE MAPE
77772.53 0.272
Regression Results for LessWhite
MAE MAPE
55244.51 0.515

Based on the following, we see again that our model produces less error for the MoreWhite partition than for LessWhite. Again, the orange line represents a perfect prediction, while our green line represents the line of best fit for our model. It seems that on average we have less error, but sometimes the values of some observations are so high that the MAE is larger in the MoreWhite partition than in LessWhite. Despite that, the average percent error in the LessWhite partition is double that of MoreWhite. Overall, what this tells us is that our model is deficient and is failing our test of generalizability to different contexts such as race.

Below are the maps of the Mean MAPE by neighborhood for the LessWhite and MoreWhite partitions. We first see that there is a clear spatial element to the distribution of race in Philadelphia. The black dots on our maps represent the location of each observation used in the associated partition. These maps have a diagnostic purpose. In the case of MoreWhite, it calls us to consider the northwest part of Fishtown – namely, what could be happening there that drives the abnormally high MAPE in our map? In the LessWhite partition, where we already saw that we are failing to predict accurately in our scatterplot, there is a lot of error in the Kensington/North Philadelphia area, as well as in the southwest Philadelphia area. It suggests that, in order to have a model that generalizes to different race contexts, whatever feature we develop next should not be neighborhood specific but relevant at a larger scale. Our errors are spread across space, so our next feature must have a relationship with sale price that highlights some nuance about the context of race in Philadelphia. These maps reflect our findings that our model performs more accurately in more privileged neighborhoods than in relatively deprived ones.

4 Discussion and Conclusion

We started this model building process because we wanted to improve Zillow’s Zestimate home price prediction model. It was stated that understanding local contexts and processes would improve their model because home prices, since they cluster at a variety of spatial scales, are hard to predict. The job of our model then is to capture those scales using available data and engineered spatial features that affect home sale transactions.

As geospatial data analysts, we considered what data we could bring in, how it could affect Sale Price, and then began to explore, test, and innovate features to see if they could explain the Sale Prices that we observed in our data. We felt that people purchase homes for a multitude of reasons such as perceived safety, distance to amenities, size of the home, and changes in the local built environment. We accounted for some of these factors using engineered features such as the count of gunshot wound reports per quarter-mile of a home, the count of trees per square mile, and the average distance to the nearest 5 new construction building permits. The building permits feature was interesting because it relates a home sale to sanctioned neighborhood change. New construction represents investment in an area and it represents where the city has an interest in approving building permits. A reason we wanted to use building permits was because perhaps they could predict prices in areas with a high variability of prices, namely places where there is an enlarging rent gap. Another feature we found useful was the gun shot wound reports. Crime data is not always accurately maintained because reporting is subject to biases. For instance, not all neighborhoods report petty theft or crime. However, the critical nature of a gunshot wound necessitates assistance from public health and/or law enforcement agencies. We feel confident that there is little reporting bias of these incidents, particularly for fatal shootings. Therefore, the count of shooting victims is a reliable measure for gun violence, which we posited would influence sale prices. Both the new construction permits and the gun shot wounds data were significant predictors of sale prices, with p-values < .01. Other variables were not surprising but worth noting, such as: the quantity of bedrooms, the size of the living area, the median rents of the area, and the distance to the nearest commercial corridor.

Our process encapsulates the machine learning framework used to create a model that predicts home sale prices in a way that is both accurate and generalizable to new contexts. Once we chose our explanatory variables, we performed an OLS regression on sale price for a part of our data (training set) to get a line of best fit regression equation that summarized the relationship between our variables and the observed sale price values. The \(R^2\) in our summary statistics of the training set regression tells us how much of the variability in observed sale price was explained by our model – in this case, it was about 72%. Our model is effective at predicting sale price but we were unsatisfied with our results. We used our regression equation to predict sale price in our test set of data, which our model had not seen before. Our mean absolute error (MAE) was about $66,000, meaning our model over or under estimated sale prices by that amount. Moreover, we feel that we were able to account well for spatial variation in the data. If our Moran’s I test result had been close to 0, it would have indicated little to no spatial autocorrelation; because it was not, there are still more outlying spatial processes that we have not accounted for leading to a clustering of our errors. On the other hand, when comparing our predicted sale price map with our observed sale price map, it would be difficult to identify where our errors are. This is why we must perform Moran’s I test on our errors, because if we based our conclusion on whether or not there was spatial autocorrelation with an eye test, we would ultimately fail.

Our model predicted sale prices in “MoreWhite” areas of the city particularly well. On the other hand, when it came to “LessWhite” areas, dominated by sale prices that are on the lower end of the range of values where there is also more poverty, our model did not perform well. Although our standard deviation of errors in the cross validation test was low, meaning that the model generalizes well to new data, we were disappointed when we tested our model on different race contexts. There was a stark difference between the accuracy in the MoreWhite partition (MAPE of 27%) than those in LessWhite (MAPE of 52%). This could be due to many reasons, one being that in lower income neighborhoods like Mantua and Kingsessing, there are people who are “flipping” properties, raising their sale price to a standard that the area is not used to seeing. This variability at a parcel level scale must make it difficult for our model to predict accurately in these areas. Although we tried to tease out this process through the incorporation of the new construction permit data, it is possible that it was not sufficient in itself. Perhaps other types of permits that were in the data, such as for plumbing, would be better indicators since sometimes the housing stock’s exterior is retained while all of the capital injected into the property finds its place inside the home.

To conclude, we wanted to use existing data and create new geospatial features that could account for spatial processes taking place at the local level to improve on Zillow’s zestimate model. Although we cannot say if our model is better than Zillow’s we can definitely say that it accounts for a lot of the local context that influences home sale prices. Our model is not perfect, not even close. It could benefit from a re-working of permit data, the inclusion of data related to schools, jobs in the area, and the transportation network to name a few factors that are missing in our regression model. We would also need to investigate what is going on in neighborhoods that we see our largest errors. We believe that the best way to do this is through ground-truthing, going to these places and getting to know its people and the built environment. Such an action would benefit our local understanding of spatial processes that affect and define these areas, and it would inform our next set of engineered features to improve our model. If we were to incorporate these things we would surely get a more accurate model that is generalizable to multiple contexts and that lacks clusters of errors.

5 References

Massey, D. S. (2001). The prodigal paradigm returns: ecology comes back to sociology. Does It Take a Village? Community Effects on Children, Adolescents, and Families, 41–48.

Krieger, N., Waterman, P. D., Spasojevic, J., Li, W., Maduro, G., & Van Wye, G. (2016). Public Health Monitoring of Privilege and Deprivation With the Index of Concentration at the Extremes. American journal of public health, 106(2), 256–263. https://doi.org/10.2105/AJPH.2015.302955